Amazon Sagemaker Debugger

Paper: https://assets.amazon.science/0b/cb/47bb9a1e4b6a8f78ed7a7611f4a7/amazon-sagemaker-debugger-a-system-for-real-time-insights-into-machine-learning-model-training.pdf?fbclid=IwAR2p_Jxj4CJTA7ESs4_DhoTYveGyNfHLRVk3Zimfb-Vd4W_bzkNkgHpz7MM
Sagemaker debugger
- identifies and stops underperforming jobs
- framework agnostic
  - vanishing or exploding gradients
  - neuron saturation
  - overfitting
"smdebug"
Lifecycle
- data prep
- model training – iterative
  - monitor and stop early in case of issues
- hyperparameter tuning
- deployment
  - data drift, etc.
Sagemaker's approach
- smdebug: record and load tensors
- rules to analyze a job
- instrumentation
  - pytorch forward hooks, etc.
  - smdebug wraps the apis
  - containers are pre-modified
  - specify tensors by regex
  - recorded as protobuf files (tensorboard?) – analyze with smdebug
- written tensors are analyzed by cloudwatch for early stopping rules
scaling
- offload into separate containers
- optimizations: sampling, aggregations, save intervals
- store separately / allow for compute
  - allows for handling data volume
rules
- measure imbalance
- check that inputs are normalized correctly, 0 mean/1 variance
- activation functions
  - neurons suffer saturation – leading to vanishing gradients
  - dying relu
  - fixed by scaling to allow for symmetric initialization
- loss
  - loss not decreasing
  - overfitting
  - underfitting
- tensors
  - all zeros
  - all small
  - values unchanging
- parameter initialization – check properties
- thresholds for gradients
- compare weight updates with gradient
- overfitting with eval loss > train loss
  - if eval loss exceeds train loss at some point
- xgboost – large or shallow trees
Applications
- additional visualizations to tune the model
- iterative model pruning – stop the model before loss stops
- model understanding
Case study
- saving tensors every 10 steps causes a 1.9x slowdown
- but every 200 steps is within 1,2
highlights
- easier tensor access without modifying training
- can modify access while training is in progress
- rule analysis is out of band
Follow up resources
- evaluate smdebug source code
- imbalance ratio – johnson & khoshgoftaar – max classes / min classes
- salience maps
- tensorwatch
  
  **

Backlinks